{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第四节 检索增强生成的原理与实现（Retrieval Augmented Generation，RAG）\n",
    "\n",
    "\n",
    "检索增强生成（Retrieval Augmented Generation），简称 RAG，已经成为当前最火热的LLM应用方案。它是一个为大模型提供外部知识源的概念，这使它们能够生成准确且符合上下文的答案，同时能够减少模型幻觉。\n",
    "\n",
    "RAG结合了检索（Retrieval）和生成（Generation）两大核心技术，通过这种独特的混合机制，能够在处理复杂的查询和生成任务时，提供更加准确、丰富的信息。无论是在回答复杂的问题，还是在创作引人入胜的故事，RAG都展现了其不可小觑的能力。\n",
    "\n",
    "* 数据准备阶段：数据提取 >> 文本分割 >> 向量化（embedding) >> 数据入库\n",
    "\n",
    "* 应用阶段：用户提问 >> 数据检索（召回） >> 注入Prompt >> LLM生成答案|\n",
    "\n",
    "本课件主要参考了如下两个链接：\n",
    "\n",
    "https://datawhalechina.github.io/llm-universe/#/C4/1.%20%E7%9F%A5%E8%AF%86%E5%BA%93%E6%96%87%E6%A1%A3%E5%A4%84%E7%90%86\n",
    "\n",
    "https://blog.csdn.net/qq_43548590/article/details/135512188\n",
    "\n",
    "其他参考资料：\n",
    "\n",
    "https://blog.csdn.net/Julialove102123/article/details/135714213\n",
    "\n",
    "https://www.zhihu.com/tardis/bd/art/656646499?source_id=1001\n",
    "\n",
    "https://zhuanlan.zhihu.com/p/640936557"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "#导入语言模型\n",
    "import os\n",
    "from langchain_community.llms import Tongyi\n",
    "from langchain_community.llms import SparkLLM\n",
    "from langchain_community.llms import QianfanLLMEndpoint\n",
    "\n",
    "import pandas as pd\n",
    "#导入模版\n",
    "from langchain.prompts import PromptTemplate\n",
    "\n",
    "#导入聊天模型\n",
    "from langchain.prompts.chat import (\n",
    "    ChatPromptTemplate,\n",
    "    SystemMessagePromptTemplate,\n",
    "    AIMessagePromptTemplate,\n",
    "    HumanMessagePromptTemplate,\n",
    ")\n",
    "from langchain.schema import (\n",
    "    AIMessage,\n",
    "    HumanMessage,\n",
    "    SystemMessage\n",
    ")\n",
    "\n",
    "from langchain_community.chat_models import ChatSparkLLM\n",
    "from langchain_community.chat_models.tongyi import ChatTongyi\n",
    "from langchain_community.chat_models import QianfanChatEndpoint\n",
    "\n",
    "#输入三个模型各自的key\n",
    "\n",
    "os.environ[\"DASHSCOPE_API_KEY\"] = \"\"\n",
    "\n",
    "os.environ[\"IFLYTEK_SPARK_APP_ID\"] = \"\"\n",
    "os.environ[\"IFLYTEK_SPARK_API_KEY\"] = \"\"\n",
    "os.environ[\"IFLYTEK_SPARK_API_SECRET\"] = \"\"\n",
    "\n",
    "os.environ[\"QIANFAN_AK\"] = \"\"\n",
    "os.environ[\"QIANFAN_SK\"] = \"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_ty = Tongyi(temperature=0.1)\n",
    "model_qf = QianfanLLMEndpoint(temperature=0.1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 文档的读入\n",
    "\n",
    "### pdf文档的读入"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import PyPDFLoader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Multiple definitions in dictionary at byte 0x32502b for key /MediaBox\n"
     ]
    }
   ],
   "source": [
    "loader_pdf = PyPDFLoader(r\"G:\\迅雷下载\\基于文本数据的汽车造型需求分析_苏佳幸.pdf\")\n",
    "docs_pdf = loader_pdf.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "list"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(docs_pdf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(docs_pdf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "langchain_core.documents.base.Document"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(docs_pdf[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'source': 'G:\\\\迅雷下载\\\\基于文本数据的汽车造型需求分析_苏佳幸.pdf', 'page': 0}"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs_pdf[0].metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'150  AUTO TIME\\nAUTO AFTERMARKET  | 汽车后市场\\n1\\u3000引言\\n随着国家经济的发展，人民的生活品质\\n和水平的提高和购买力的增强，汽车的销量\\n飞增。国家统计局数据显示，汽车制造业生产较 2021年同比增长 5.5％\\n[1]，虽然汽车行\\n业发展迅速，但考虑长远发展仍需推动国内汽车行业深层次发展，本文介绍的文本数据法依托消费者的评价进行分析，促进更高效的产品生产，帮助企业占据市场有利地位。\\n数字化用户画像是指借助大数据来提取\\n用户的相应特征、需求、偏好等内容并建模分析。\\n[2]随着社交媒体如微博、论坛等逐渐\\n成熟，能够最直观反映用户体验的各种评价不断更新，传统的人工挖掘方式也变得更加困难、效率低下，并且用户评论中含有大量零碎且多样的情感词汇，难以分析。，本文对国内学者季曹婷等人提出的融合多特征TFIDF文本分析的汽车造型需求提取方法和余本功等人提出的在解决问答社区关键词提取问题的想法进行融合，提出了新的方法\\n[3]，\\n具体流程见图 1。 \\n虽然现如今国内对于该方面的研究较少，\\n但不难看出将大数据用户画像与汽车造型相融合能够使汽车行业在未来更加繁荣。\\n2\\u3000研究方法\\n此次研究采用了内容分析法[4]。主要通\\n过对特定文本中单词和词组的频率计数进行，将定性的文本数据转化为定量的频数。这种方法真实、客观、全面地反映文本内容的本来意义，具有一定的深度。内容分析法经过选择、分类、统计等三个阶段，以爬取搜集的网络评论文本为分析内容，对数据进行预处理，评论进行分句，删减无关、重复评论，得到筛选清理的评论。再对其进行词语提取，将描述性词语及情感词提取出来，经过统计、排序后，绘制得到高频词汇表，依据高频词汇表解读、判断和挖掘信息中所蕴含的本质内容。苏佳幸\\u3000李伽熙\\u3000李睿思\\u3000李俊豪\\u3000赵云芸\\n通讯作者\\n武汉商学院\\u3000湖北省武汉市\\u3000 430100\\n摘\\u2003 要：  近年来互联网的快速发展，汽车业已在网络上建立起了自己的“生态圈”，社交媒体逐渐成熟，传统\\n的人工挖掘汽车评论的方式显得效率低下。本文运用网络信息搜集技术对用户评论的文本数据进行搜\\n集，并基于此对消费者汽车造型需求展开分析，有效地弥补传统方法的缺陷，获得评价的潜在价值和情绪信息，提炼出汽车需求的关键要素，并据此提供给企业相关建议，同时也让消费者对车主使用满意程度有所了解，为新消费者购买汽车提供依据。\\n关键词： 汽车\\u3000用户'"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs_pdf[0].page_content[0:1000]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### txt文档的读入"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import TextLoader\n",
    "loader_txt = TextLoader(r'G:\\云岚宗.txt', encoding='utf8')\n",
    "docs_txt = loader_txt.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(docs_txt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'source': 'G:\\\\云岚宗.txt'} 第四章 云岚宗\n",
      "\n",
      "大厅中,萧战以及三位长老,正在颇为热切的与那位陌生老者交谈着,不过这位老者似乎有什么难以启齿的事情一般,每每到口的话语,都将会有些无奈的咽了回去,而每当这个时候,一旁的娇贵少女,都是\n"
     ]
    }
   ],
   "source": [
    "print(docs_txt[0].metadata,docs_txt[0].page_content[0:100])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 网页文档的读入\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders import WebBaseLoader\n",
    "WEB_URL = \"https://news.ifeng.com/c/8Y3TlIcTsj0\"\n",
    "loader_html = WebBaseLoader(WEB_URL)\n",
    "docs_html = loader_html.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(docs_html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'source': 'https://news.ifeng.com/c/8Y3TlIcTsj0', 'title': '足协原副主席王登峰被判17年_凤凰网', 'description': '足协原副主席王登峰被判17年', 'language': 'zh'} 足协原副主席王登峰被判17年_凤凰网\n",
      "\n",
      "\n",
      "\n",
      "首页资讯视频直播凤凰卫视财经娱乐体育时尚汽车房产科技读书文化历史军事旅游佛教更多国学数码健康家居公益教育酒业美食资讯 > 大陆 > 正文足协原副主席王登峰被\n"
     ]
    }
   ],
   "source": [
    "print(docs_html[0].metadata,docs_html[0].page_content[0:100])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 文档的分割\n",
    "\n",
    "Langchain 中文本分割器都根据 chunk_size (块大小)和 chunk_overlap (块与块之间的重叠大小)进行分割。\n",
    "\n",
    "* chunk_size 指每个块包含的字符或 Token（如单词、句子等）的数量\n",
    "\n",
    "* chunk_overlap 指两个块之间共享的字符数量，用于保持上下文的连贯性，避免分割丢失上下文信息\n",
    "\n",
    "Langchain 提供多种文档分割方式，区别在怎么确定块与块之间的边界、块由哪些字符/token组成、以及如何测量块大小\n",
    "\n",
    "RecursiveCharacterTextSplitter(): 按字符串分割文本，递归地尝试按不同的分隔符进行分割文本。\n",
    "\n",
    "CharacterTextSplitter(): 按字符来分割文本。\n",
    "\n",
    "MarkdownHeaderTextSplitter(): 基于指定的标题来分割markdown 文件。\n",
    "\n",
    "TokenTextSplitter(): 按token来分割文本。\n",
    "\n",
    "SentenceTransformersTokenTextSplitter(): 按token来分割文本。\n",
    "\n",
    "Language(): 用于 CPP、Python、Ruby、Markdown 等。\n",
    "\n",
    "NLTKTextSplitter(): 使用 NLTK（自然语言工具包）按句子分割文本。\n",
    "\n",
    "SpacyTextSplitter(): 使用 Spacy按句子的切割文本。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "''' \n",
    "* RecursiveCharacterTextSplitter 递归字符文本分割\n",
    "RecursiveCharacterTextSplitter 将按不同的字符递归地分割(按照这个优先级[\"\\n\\n\", \"\\n\", \" \", \"\"])，\n",
    "    这样就能尽量把所有和语义相关的内容尽可能长时间地保留在同一位置\n",
    "RecursiveCharacterTextSplitter需要关注的是4个参数：\n",
    "\n",
    "* separators - 分隔符字符串数组\n",
    "* chunk_size - 每个文档的字符数量限制\n",
    "* chunk_overlap - 两份文档重叠区域的长度\n",
    "* length_function - 长度计算函数\n",
    "'''\n",
    "#导入文本分割器\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 导入递归字符文本分割器\n",
    "text_splitter_txt = RecursiveCharacterTextSplitter(chunk_size = 384, chunk_overlap = 0, separators=[\"\\n\\n\", \"\\n\", \" \", \"\", \"。\", \"，\"])\n",
    "# 导入文本\n",
    "documents_txt = text_splitter_txt.split_documents(docs_txt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(documents_txt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'第四章 云岚宗\\n\\n大厅中,萧战以及三位长老,正在颇为热切的与那位陌生老者交谈着,不过这位老者似乎有什么难以启齿的事情一般,每每到口的话语,都将会有些无奈的咽了回去,而每当这个时候,一旁的娇贵少女,都是忍不住的横了老者一眼…\\n\\n    倾耳听了一会，萧炎便是有些无聊的摇了摇头…\\n\\n    “萧炎哥哥，你知道他们的身份吗？”就在萧炎无聊得想要打瞌睡之时，身旁的熏儿，纤指再次翻开古朴的书页，目不斜视的微笑道。\\n\\n    “你知道？”好奇的转过头来，萧炎惊诧的问道。\\n\\n    “看见他们袍服袖口处的云彩银剑了么？”微微一笑，熏儿道。\\n\\n    “哦？”心头一动，萧炎目光转向三人袖口，果然是发现了一道云彩形状的银剑。\\n\\n    “他们是云岚宗的人？”萧炎惊讶的低声道。'"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "documents_txt[0].page_content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2285"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#计算字数\n",
    "sum([len(doc.page_content) for doc in documents_txt])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### classwork 1\n",
    "\n",
    "1. 实现pdf文档的导入，并对其进行分割，输出分割后的段数与某段的内容\n",
    "\n",
    "2. 实现在线新闻的导入，并对其进行分割，输出分割后的段数与某段的内容"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 文档词向量化与向量数据库\n",
    "\n",
    "在机器学习和自然语言处理（NLP）中，Embeddings（嵌入）是一种将类别数据，如单词、句子或者整个文档，转化为实数向量的技术。这些实数向量可以被计算机更好地理解和处理。嵌入背后的主要想法是，相似或相关的对象在嵌入空间中的距离应该很近。\n",
    "\n",
    "举个例子，我们可以使用词嵌入（word embeddings）来表示文本数据。在词嵌入中，每个单词被转换为一个向量，这个向量捕获了这个单词的语义信息。例如，\"king\" 和 \"queen\" 这两个单词在嵌入空间中的位置将会非常接近，因为它们的含义相似。而 \"apple\" 和 \"orange\" 也会很接近，因为它们都是水果。而 \"king\" 和 \"apple\" 这两个单词在嵌入空间中的距离就会比较远，因为它们的含义不同。\n",
    "\n",
    "让我们取出我们的切分部分并对它们进行 Embedding 处理。\n",
    "\n",
    "langchain官网提供了很多向量化的模型：\n",
    "https://python.langchain.com/docs/integrations/text_embedding\n",
    "\n",
    "这里我们使用其中的两种\n",
    "\n",
    "1. baidu_qianfan_endpoint\n",
    "\n",
    "https://python.langchain.com/docs/integrations/text_embedding/baidu_qianfan_endpoint\n",
    "\n",
    "2. Hugging Face\n",
    "\n",
    "https://python.langchain.com/docs/integrations/text_embedding/huggingfacehub\n",
    "\n",
    "\n",
    "### 百度提供的向量化接口（key只要导入之前大模型的就可以了）\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.embeddings import QianfanEmbeddingsEndpoint\n",
    "import numpy as np\n",
    "\n",
    "embeddings_qf = QianfanEmbeddingsEndpoint()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[INFO] [03-19 09:42:51] openapi_requestor.py:316 [t:3856]: requesting llm api endpoint: /embeddings/embedding-v1\n",
      "[INFO] [03-19 09:42:51] oauth.py:207 [t:3856]: trying to refresh access_token for ak `og6mWr***`\n",
      "[INFO] [03-19 09:42:52] oauth.py:220 [t:3856]: sucessfully refresh access_token\n",
      "[INFO] [03-19 09:42:52] openapi_requestor.py:316 [t:3856]: requesting llm api endpoint: /embeddings/embedding-v1\n",
      "[INFO] [03-19 09:42:53] openapi_requestor.py:316 [t:3856]: requesting llm api endpoint: /embeddings/embedding-v1\n"
     ]
    }
   ],
   "source": [
    "query1 = \"狗\"\n",
    "query2 = \"猫\"\n",
    "query3 = \"雨\"\n",
    "\n",
    "# 通过对应的 embedding 类生成 query 的 embedding。\n",
    "emb1 = embeddings_qf.embed_query(query1)\n",
    "emb2 = embeddings_qf.embed_query(query2)\n",
    "emb3 = embeddings_qf.embed_query(query3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.47321960541829167\n",
      "0.3674960308371415\n",
      "0.21290156815031291\n"
     ]
    }
   ],
   "source": [
    "# 将返回结果转成 numpy 的格式，便于后续计算\n",
    "emb1 = np.array(emb1)\n",
    "emb2 = np.array(emb2)\n",
    "emb3 = np.array(emb3)\n",
    "\n",
    "print(np.dot(emb1, emb2))\n",
    "print(np.dot(emb3, emb2))\n",
    "print(np.dot(emb1, emb3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 离线的向量化模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 第一次运行会下载半个G的模型\n",
    "from langchain.embeddings import HuggingFaceEmbeddings\n",
    "embeddings_hf = HuggingFaceEmbeddings(model_name=\"moka-ai/m3e-base\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "320.0446771489675\n",
      "302.9192630617566\n",
      "300.0346646835118\n"
     ]
    }
   ],
   "source": [
    "query1 = \"狗\"\n",
    "query2 = \"猫\"\n",
    "query3 = \"雨\"\n",
    "\n",
    "# 通过对应的 embedding 类生成 query 的 embedding。\n",
    "emb1 = embeddings_hf.embed_query(query1)\n",
    "emb2 = embeddings_hf.embed_query(query2)\n",
    "emb3 = embeddings_hf.embed_query(query3)\n",
    "\n",
    "# 将返回结果转成 numpy 的格式，便于后续计算\n",
    "emb1 = np.array(emb1)\n",
    "emb2 = np.array(emb2)\n",
    "emb3 = np.array(emb3)\n",
    "\n",
    "print(np.dot(emb1, emb2))\n",
    "print(np.dot(emb3, emb2))\n",
    "print(np.dot(emb1, emb3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 向量数据库（Chroma）\n",
    "\n",
    "\n",
    "向量数据库是用于高效计算和管理大量向量数据的解决方案。向量数据库是一种专门用于存储和检索向量数据（embedding）的数据库系统。它与传统的基于关系模型的数据库不同，它主要关注的是向量数据的特性和相似性。\n",
    "\n",
    "在向量数据库中，数据被表示为向量形式，每个向量代表一个数据项。这些向量可以是数字、文本、图像或其他类型的数据。向量数据库使用高效的索引和查询算法来加速向量数据的存储和检索过程。\n",
    "\n",
    "Langchain 集成了超过 30 个不同的向量存储库。我们选择 Chroma 是因为它轻量级且数据存储在内存中，这使得它非常容易启动和开始使用。\n",
    "\n",
    "langchain支持列表\n",
    "\n",
    "https://python.langchain.com/docs/integrations/vectorstores\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 数据库的创建与载入"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#! pip install chromadb -i https://mirrors.aliyun.com/pypi/simple"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "#help(Chroma.from_documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[INFO] [03-19 09:44:12] openapi_requestor.py:316 [t:3856]: requesting llm api endpoint: /embeddings/embedding-v1\n"
     ]
    }
   ],
   "source": [
    "from langchain.vectorstores import Chroma\n",
    "# 导入千帆向量模型\n",
    "embeddings_qf = QianfanEmbeddingsEndpoint()\n",
    "\n",
    "#persist_directory允许我们将目录保存到磁盘上\n",
    "#注意from_documents每次运行都是把数据添加进去\n",
    "vectordb = Chroma.from_documents(documents=documents_txt, embedding=embeddings_qf, persist_directory=\"G:\\\\\" )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "vectordb.persist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "vectordb_load = Chroma(\n",
    "    persist_directory=\"G:\\\\\",\n",
    "    embedding_function=embeddings_qf\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "14"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectordb_load._collection.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 通过向量数据库检索\n",
    "\n",
    "##### 相似度检索"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[INFO] [03-19 09:44:34] openapi_requestor.py:316 [t:3856]: requesting llm api endpoint: /embeddings/embedding-v1\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[Document(page_content='第四章 云岚宗\\n\\n大厅中,萧战以及三位长老,正在颇为热切的与那位陌生老者交谈着,不过这位老者似乎有什么难以启齿的事情一般,每每到口的话语,都将会有些无奈的咽了回去,而每当这个时候,一旁的娇贵少女,都是忍不住的横了老者一眼…\\n\\n    倾耳听了一会，萧炎便是有些无聊的摇了摇头…\\n\\n    “萧炎哥哥，你知道他们的身份吗？”就在萧炎无聊得想要打瞌睡之时，身旁的熏儿，纤指再次翻开古朴的书页，目不斜视的微笑道。\\n\\n    “你知道？”好奇的转过头来，萧炎惊诧的问道。\\n\\n    “看见他们袍服袖口处的云彩银剑了么？”微微一笑，熏儿道。\\n\\n    “哦？”心头一动，萧炎目光转向三人袖口，果然是发现了一道云彩形状的银剑。\\n\\n    “他们是云岚宗的人？”萧炎惊讶的低声道。', metadata={'source': 'G:\\\\云岚宗.txt'}),\n",
       " Document(page_content='第四章 云岚宗\\n\\n大厅中,萧战以及三位长老,正在颇为热切的与那位陌生老者交谈着,不过这位老者似乎有什么难以启齿的事情一般,每每到口的话语,都将会有些无奈的咽了回去,而每当这个时候,一旁的娇贵少女,都是忍不住的横了老者一眼…\\n\\n    倾耳听了一会，萧炎便是有些无聊的摇了摇头…\\n\\n    “萧炎哥哥，你知道他们的身份吗？”就在萧炎无聊得想要打瞌睡之时，身旁的熏儿，纤指再次翻开古朴的书页，目不斜视的微笑道。\\n\\n    “你知道？”好奇的转过头来，萧炎惊诧的问道。\\n\\n    “看见他们袍服袖口处的云彩银剑了么？”微微一笑，熏儿道。\\n\\n    “哦？”心头一动，萧炎目光转向三人袖口，果然是发现了一道云彩形状的银剑。\\n\\n    “他们是云岚宗的人？”萧炎惊讶的低声道。', metadata={'source': 'G:\\\\云岚宗.txt'}),\n",
       " Document(page_content='虽然并没有外出历练，不过萧炎在一些书籍中却看过有关这剑派的资料，萧家所在的城市名为乌坦城，乌坦城隶属于加玛帝国，虽然此城因为背靠魔兽山脉的地利，而跻身进入帝国的大城市之列，不过也仅仅只是居于末座。\\n\\n    萧炎的家族，在乌坦城颇有份量，不过却也并不是唯一，城市中，还有另外两大家族实力与萧家相差无几，三方彼此明争暗斗了几十年，也未曾分出胜负…\\n\\n    如果说萧家是乌坦城的一霸，那么萧炎口中所说的云岚宗，或许便应该说是整个加玛帝国的一霸！这之间的差距，犹如鸿沟，也难怪连平日严肃的父亲，在言语上很是敬畏。\\n\\n    “他们来我们家族做什么？”萧炎有些疑惑的低声询问道。\\n\\n    移动的纤细指尖微微一顿，熏儿沉默了一会，方才道：“或许和萧炎哥哥有关…”\\n\\n    “我？我可没和他们有过什么交集啊？”闻言，萧炎一怔，摇头否认。', metadata={'source': 'G:\\\\云岚宗.txt'}),\n",
       " Document(page_content='虽然并没有外出历练，不过萧炎在一些书籍中却看过有关这剑派的资料，萧家所在的城市名为乌坦城，乌坦城隶属于加玛帝国，虽然此城因为背靠魔兽山脉的地利，而跻身进入帝国的大城市之列，不过也仅仅只是居于末座。\\n\\n    萧炎的家族，在乌坦城颇有份量，不过却也并不是唯一，城市中，还有另外两大家族实力与萧家相差无几，三方彼此明争暗斗了几十年，也未曾分出胜负…\\n\\n    如果说萧家是乌坦城的一霸，那么萧炎口中所说的云岚宗，或许便应该说是整个加玛帝国的一霸！这之间的差距，犹如鸿沟，也难怪连平日严肃的父亲，在言语上很是敬畏。\\n\\n    “他们来我们家族做什么？”萧炎有些疑惑的低声询问道。\\n\\n    移动的纤细指尖微微一顿，熏儿沉默了一会，方才道：“或许和萧炎哥哥有关…”\\n\\n    “我？我可没和他们有过什么交集啊？”闻言，萧炎一怔，摇头否认。', metadata={'source': 'G:\\\\云岚宗.txt'})]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectordb_load.similarity_search(\"云岚宗\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#####  MMR 检索\n",
    "如果只考虑检索出内容的相关性会导致内容过于单一，可能丢失重要信息。\n",
    "\n",
    "最大边际相关性 (MMR, Maximum marginal relevance) 可以帮助我们在保持相关性的同时，增加内容的丰富度。\n",
    "\n",
    "核心思想是在已经选择了一个相关性高的文档之后，再选择一个与已选文档相关性较低但是信息丰富的文档。这样可以在保持相关性的同时，增加内容的多样性，避免过于单一的结果。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[INFO] [03-18 21:57:37] openapi_requestor.py:316 [t:11216]: requesting llm api endpoint: /embeddings/embedding-v1\n",
      "Number of requested results 20 is greater than number of elements in index 7, updating n_results = 7\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[Document(page_content='第四章 云岚宗\\n\\n大厅中,萧战以及三位长老,正在颇为热切的与那位陌生老者交谈着,不过这位老者似乎有什么难以启齿的事情一般,每每到口的话语,都将会有些无奈的咽了回去,而每当这个时候,一旁的娇贵少女,都是忍不住的横了老者一眼…\\n\\n    倾耳听了一会，萧炎便是有些无聊的摇了摇头…\\n\\n    “萧炎哥哥，你知道他们的身份吗？”就在萧炎无聊得想要打瞌睡之时，身旁的熏儿，纤指再次翻开古朴的书页，目不斜视的微笑道。\\n\\n    “你知道？”好奇的转过头来，萧炎惊诧的问道。\\n\\n    “看见他们袍服袖口处的云彩银剑了么？”微微一笑，熏儿道。\\n\\n    “哦？”心头一动，萧炎目光转向三人袖口，果然是发现了一道云彩形状的银剑。\\n\\n    “他们是云岚宗的人？”萧炎惊讶的低声道。', metadata={'source': 'G:\\\\云岚宗.txt'}),\n",
       " Document(page_content='虽然并没有外出历练，不过萧炎在一些书籍中却看过有关这剑派的资料，萧家所在的城市名为乌坦城，乌坦城隶属于加玛帝国，虽然此城因为背靠魔兽山脉的地利，而跻身进入帝国的大城市之列，不过也仅仅只是居于末座。\\n\\n    萧炎的家族，在乌坦城颇有份量，不过却也并不是唯一，城市中，还有另外两大家族实力与萧家相差无几，三方彼此明争暗斗了几十年，也未曾分出胜负…\\n\\n    如果说萧家是乌坦城的一霸，那么萧炎口中所说的云岚宗，或许便应该说是整个加玛帝国的一霸！这之间的差距，犹如鸿沟，也难怪连平日严肃的父亲，在言语上很是敬畏。\\n\\n    “他们来我们家族做什么？”萧炎有些疑惑的低声询问道。\\n\\n    移动的纤细指尖微微一顿，熏儿沉默了一会，方才道：“或许和萧炎哥哥有关…”\\n\\n    “我？我可没和他们有过什么交集啊？”闻言，萧炎一怔，摇头否认。', metadata={'source': 'G:\\\\云岚宗.txt'}),\n",
       " Document(page_content='望着熏儿的躲避态势，萧炎只得无奈的撇了撇嘴，熏儿虽然也姓萧，不过与他却没有半点血缘关系，而且熏儿的父母，萧炎也从未见过，每当他询问自己的父亲时，满脸笑容的父亲便会立刻闭口不语，显然对熏儿的父母很是忌讳，甚至…惧怕！\\n\\n    在萧炎心中，熏儿的身份，极为神秘，可不管他如何侧面询问，这小妮子都会机灵的以沉默应对，让得萧炎就算有计也是无处可施。\\n\\n    “唉，算了，懒得管你，不说就不说吧…”摇了摇头，萧炎的脸色忽然阴沉了下来，因为对面那在纳兰嫣然不断示意的眼色下，那位老者，终于是站起来了…\\n\\n    “呵呵，借助着云岚宗向父亲施威么？这纳兰嫣然，真是好手段呐…”萧炎的心头，响起了愤怒的冷笑。', metadata={'source': 'G:\\\\云岚宗.txt'}),\n",
       " Document(page_content='“这老头还的确桀得可爱…”听到此处，萧炎也是忍不住笑着摇了摇头。\\n\\n    “纳兰桀在家族中拥有绝对的话语权，他说的话，一般都没人敢反对，虽然他也很疼爱纳兰嫣然这孙女，不过想要他开口解除婚约，却是有些困难…”熏儿美丽的眼睛微弯，戏谑道：“可五年之前，纳兰嫣然被云岚宗宗主云韵亲自收做弟子，五年间，纳兰嫣然表现出了绝佳的修炼天赋，更是让得云韵对其宠爱不已…当一个人拥有了改变自己命运的力量时候，那么她会想尽办法将自己不喜欢的事，解决掉…很不幸的，萧炎哥哥与她的婚事，便是让她最不满意的地方！”\\n\\n    “你是说，她此次是来解除婚约的？”', metadata={'source': 'G:\\\\云岚宗.txt'})]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectordb_load.max_marginal_relevance_search(\"云岚宗\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### classwork 2\n",
    "\n",
    "1. 基于pdf文档的导入与分割后，向量化对应文档并将其存入向量数据库并持久化（建议不同文档不同目录），然后尝试通过相似性进行检索\n",
    "\n",
    "2. 对在线新闻实现上题一样的操作，并尝试通过相似性进行检索"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 构造检索式问答链\n",
    "\n",
    "\n",
    "* 四种文档链：Stuff；Refine ；Map reduce；Map re-rank\n",
    "\n",
    "https://python.langchain.com.cn/docs/modules/chains/document/\n",
    "\n",
    "\n",
    "* Stuff documents\n",
    "\n",
    "https://python.langchain.com.cn/docs/modules/chains/document/stuff"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chains.combine_documents import create_stuff_documents_chain\n",
    "from langchain.chains import create_retrieval_chain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 创建提示词模板\n",
    "prompt = ChatPromptTemplate.from_template(\"\"\"使用下面的语料来回答本模板最末尾的问题。如果你不知道问题的答案，直接回答 \"我不知道\"，禁止随意编造答案。\n",
    "        为了保证答案尽可能简洁，你的回答必须不超过三句话，你的回答中不可以带有星号。\n",
    "        请注意！在每次回答结束之后，你都必须接上 \"感谢你的提问\" 作为结束语\n",
    "        以下是一对问题和答案的样例：\n",
    "            请问：秦始皇的原名是什么\n",
    "            秦始皇原名嬴政。感谢你的提问。\n",
    "        \n",
    "        以下是语料：\n",
    "<context>\n",
    "{context}\n",
    "</context>\n",
    "\n",
    "Question: {input}\"\"\")\n",
    "#创建检索链\n",
    "document_chain = create_stuff_documents_chain(model_qf, prompt)\n",
    "\n",
    "retriever = vectordb_load.as_retriever()\n",
    "retrieval_chain = create_retrieval_chain(retriever, document_chain)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[INFO] [03-19 09:44:58] openapi_requestor.py:316 [t:22552]: requesting llm api endpoint: /embeddings/embedding-v1\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[INFO] [03-19 09:44:58] openapi_requestor.py:316 [t:15196]: requesting llm api endpoint: /chat/eb-instant\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "根据语料信息，萧炎是一位有潜力的强者，但目前并没有外出历练。他在乌坦城和加玛帝国有一定的地位，但并不是唯一的一家。他口中所说的云岚宗在加玛帝国是一霸，这表明他有一定的实力和背景。此外，他对于纳兰嫣然的解除婚约请求感到愤怒，这表明他有一定的自尊心和保护自己尊严的决心。因此，可以说萧炎是一位有潜力、有背景、有自尊心的强者。\n"
     ]
    }
   ],
   "source": [
    "response = retrieval_chain.invoke({\n",
    "    \"input\": \"萧炎是怎样的强者?\"\n",
    "})\n",
    "print(response[\"answer\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 构建一个检索式问答RAG的全过程"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[INFO] [03-19 09:45:05] openapi_requestor.py:316 [t:3856]: requesting llm api endpoint: /embeddings/embedding-v1\n"
     ]
    }
   ],
   "source": [
    "# 构建一个检索式问答RAG的全过程\n",
    "\n",
    "model_qf = QianfanLLMEndpoint(temperature=0.1)\n",
    "loader_txt = TextLoader(r'G:\\云岚宗.txt', encoding='utf8')\n",
    "docs_txt = loader_txt.load()\n",
    "text_splitter_txt = RecursiveCharacterTextSplitter(chunk_size = 384, chunk_overlap = 0, separators=[\"\\n\\n\", \"\\n\", \" \", \"\", \"。\", \"，\"])\n",
    "documents_txt = text_splitter_txt.split_documents(docs_txt)\n",
    "embeddings_qf = QianfanEmbeddingsEndpoint()\n",
    "vectordb = Chroma.from_documents(documents=documents_txt, embedding=embeddings_qf, persist_directory=\"G:\\\\\" )\n",
    "\n",
    "prompt = ChatPromptTemplate.from_template(\"\"\"使用下面的语料来回答本模板最末尾的问题。如果你不知道问题的答案，直接回答 \"我不知道\"，禁止随意编造答案。\n",
    "        为了保证答案尽可能简洁，你的回答必须不超过三句话，你的回答中不可以带有星号。\n",
    "        请注意！在每次回答结束之后，你都必须接上 \"感谢你的提问\" 作为结束语\n",
    "        以下是一对问题和答案的样例：\n",
    "            请问：秦始皇的原名是什么\n",
    "            秦始皇原名嬴政。感谢你的提问。\n",
    "        \n",
    "        以下是语料：\n",
    "<context>\n",
    "{context}\n",
    "</context>\n",
    "\n",
    "Question: {input}\"\"\")\n",
    "#创建检索链\n",
    "document_chain = create_stuff_documents_chain(model_qf, prompt)\n",
    "\n",
    "retriever = vectordb.as_retriever()\n",
    "retrieval_chain = create_retrieval_chain(retriever, document_chain)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 构建一个检索式文档对话模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.memory import ConversationBufferMemory, ConversationSummaryMemory\n",
    "from langchain.chains import ConversationalRetrievalChain\n",
    "\n",
    "memory = ConversationSummaryMemory(\n",
    "    llm=model_qf, memory_key=\"chat_history\", return_messages=True\n",
    ")\n",
    "qa = ConversationalRetrievalChain.from_llm(llm=model_qf, retriever=retriever, memory=memory)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[INFO] [03-19 09:46:37] openapi_requestor.py:316 [t:3856]: requesting llm api endpoint: /chat/eb-instant\n",
      "[INFO] [03-19 09:46:37] openapi_requestor.py:316 [t:3856]: requesting llm api endpoint: /embeddings/embedding-v1\n",
      "[INFO] [03-19 09:46:37] openapi_requestor.py:316 [t:3856]: requesting llm api endpoint: /chat/eb-instant\n",
      "[INFO] [03-19 09:46:40] openapi_requestor.py:316 [t:3856]: requesting llm api endpoint: /chat/eb-instant\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "根据原文信息得出，萧炎是故事中的一个人物，是主人公萧炎的昵称。文中提到，虽然萧炎并没有外出历练，但他通过一些书籍了解到了有关剑派的资料，并且他的家族位于乌坦城，这是一个隶属于加玛帝国的大城市。文中还提到，萧炎的家族在乌坦城颇有份量，但并不是唯一的家族，还有另外两个实力与萧家相差无几的家族。文中还提到，萧炎口中所说的云岚宗可能是整个加玛帝国的一霸，这之间的差距犹如鸿沟。此外，文中还提到了熏儿和纳兰嫣然等人物。因此可以得出结论，萧炎是故事中的主人公。\n"
     ]
    }
   ],
   "source": [
    "res = qa.invoke(\n",
    "    {\"question\": \"萧炎是谁\"}\n",
    ")\n",
    "print(res[\"answer\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[INFO] [03-18 23:09:44] openapi_requestor.py:316 [t:11216]: requesting llm api endpoint: /chat/eb-instant\n",
      "[INFO] [03-18 23:09:44] openapi_requestor.py:316 [t:11216]: requesting llm api endpoint: /embeddings/embedding-v1\n",
      "[INFO] [03-18 23:09:45] openapi_requestor.py:316 [t:11216]: requesting llm api endpoint: /chat/eb-instant\n",
      "[INFO] [03-18 23:09:47] openapi_requestor.py:316 [t:11216]: requesting llm api endpoint: /chat/eb-instant\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "根据原文信息得出，萧炎还没有结婚。文中提到，纳兰嫣然是纳兰桀的孙女，也是萧炎的未婚妻，但文中提到纳兰桀这老头不仅性子桀骜，而且为人又极其在乎承喏，当年的婚事，是他亲口应下来的，所以就算萧炎最近几年名声极差，他也未曾派人过来悔婚。因此可以得出结论，萧炎目前还没有结婚。\n"
     ]
    }
   ],
   "source": [
    "res = qa.invoke(\n",
    "    {\"question\": \"他结婚了吗？\"}\n",
    ")\n",
    "print(res[\"answer\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### classwork 3\n",
    "\n",
    "1. 基于上面新闻页面实现检索式的文档问答\n",
    "\n",
    "2. 基于上面新闻页面实现基于检索式的聊天问答"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
